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Abstract 



In this paper, we propose an unifying view of several recently proposed structured sparsity- 
inducing norms. We consider the situation of a model simultaneously (a) penalized by a set- 
function defined on the support of the unknown parameter vector which represents prior knowl- 
edge on supports, and (b) regularized in / p -norm. We show that the natural combinatorial 
optimization problems obtained may be relaxed into convex optimization problems and intro- 
duce a notion, the lower combinatorial envelope of a set-function, that characterizes the tightness 
of our relaxations. We moreover establish links with norms based on latent representations in- 
cluding the latent group Lasso and block- coding, and with norms obtained from submodular 
functions. 



1 Introduction 

The last years have seen the emergence of the field of structured sparsity, which aims at identifying 
a model of small complexity given a priori knowledge on its possible structure. 

Various regularizations, in particular convex, have been proposed that formalized the notion that 
prior information can be expressed through functions encoding the set of possible or encouraged 
supports 1 in the model. Several convex regularizers for structured sparsity arose as generalizations 
of the group Lasso (Yuan and Lin, 2006) to the case of overlapping groups (Jenatton et al., 2011a; 
Jacob et al., 2009; Mairal et al., 2011), in particular to tree-structured groups (Zhao et al., 2009; 
Kim and Xing, 2010; Jenatton et al., 2011b). Other formulations have been considered based on 
variational formulations (Micchelli et al., 2011), the perspective of multiple kernel learning (Bach 
et al., 2012), submodular functions (Bach, 2010) and norms defined as convex hulls (Obozinski 
et al., 2011; Chandrasekaran et al., 2010). Non convex approaches include He and Carin (2009); 
Baraniuk et al. (2010); Huang et al. (2011). We refer the reader to Huang et al. (2011) for a concise 
overview and discussion of the related literature and to Bach et al. (2012) for a more detailed tutorial 
presentation. 

In this context, and given a model parametrized by a vector of coefficients w G R v with V = 
{l,...,d}, the main objective of this paper is to find an appropriate way to combine together 
combinatorial penalties, that control the structure of a model in terms of the sets of variables 

x By support, we mean the set of indices of non-zero parameters. 
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allowed or favored to enter the function learned, with continuous regularizers — such as £ p -norms, 
that control the magnitude of their coefficients, into a convex regularization that would control both. 

Part of our motivation stems from previous work on regularizers that "convexity" combinatorial 
penalties. Bach (2010) proposes to consider the tightest convex relaxation of the restriction of a 
submodular penalty to a unit i^-ball in the space of model parameters w £ R d . However, this 
relaxation scheme implicitly assumes that the coefficients are in a unit £oo-ball; then, the relaxation 
obtained induces clustering artifacts of the values of the learned vector. It would thus seem desirable 
to propose relaxation schemes that do not assume that coefficient are bounded but rather to control 
continuously their magnitude and to find alternatives to the foo-norm. Finally the class of functions 
considered is restricted to submodular functions. 

In this paper, we therefore consider combined penalties of the form mentioned above and propose 
first an appropriate convex relaxation in Section 2; the properties of general combinatorial functions 
preserved by the relaxation are captured by the notion of lower combinatorial envelope introduced 
in Section 2.2. Section 3 relates the convex regularization obtained to the latent group Lasso and 
to set-cover penalties, while Section 4 provides additional examples, such as the exclusive Lasso. 
We discuss in more details the case of submodular functions in Section 6 and propose for that case 
efficient algorithms and a theoretical analysis. Finally, we present some experiments in Section 7. 

Yet another motivation is to follow loosely the principle of two-part or multiple-part codes from 
MDL theory (Rissanen, 1978). In particular if the model is parametrized by a vector of parameters 
w, it is possible to encode (an approximation of) w itself with a two-part code, by encoding first 
the support Supp(w) — or set of non-zero values — of w with a code length of f(Supp(w)) and by 
encoding the actual values of w using a code based on a log prior distribution on the vector w that 
could motivate the choice of an £ p -norm as a surrogate for the code length. This leads naturally to 
consider penalties of the form fiF(Supp(w)) + v\\w\\p and to find appropriate notions of relaxation. 

Notations. When indexing vectors of M. d with a set A or B in exponent, x A and x B £ K d refer to 
two a priori unrelated vectors; by contrast, when using A as an index, and given a vector x £ ~R d , 
xa denotes the vector of R d such that [xa]z — Xi, i £ A and [xa]z = 0, i ^ A. If s is a vector in M d , 
we use the shorthand s(A) := X^eA Si an< ^ l s l denotes the vector whose elements are the absolute 
values | Si | of the elements s, in s. For p > 1, we define q through the relation | + | = 1. The 

Iq-noim of a vector w will be noted \\w\\ q = (X^ w i) 1 ^ 9 ■ F° r a function / : R d — > M, we will denote 
by /* is Fenchel-Legendre conjugate. We will write R+ for R + U {+oo}. 

2 Penalties and convex relaxations 

Let V — {1, ...,d} and 2 V = {A \ A C V} its power-set. We will consider positive- valued set- 
functions of the form F : 2 V ->• R + such that F(0) = and F(A) > for all A ^ 0. We do not 
necessarily assume that F is non-decreasing, even if it would a priori be natural for a penalty function 
of the support. We however assume that the domain of F, defined as T> := {A \ F(A) < oo}, covers 
V, i.e., satisfies UAev A = V (if F is non-decreasing, this just implies that it should be finite on 
singletons). We will denote by i xe s the indicator function of the set S, taking value on the set 
and +oo outside. We will write [fci, £2] to denote the discrete interval {fci, . . . , fo}. 

With the motivations of the previous section, and denoting by Supp(w) the set of non-zero coefficients 
of a vector w, we consider a penalty involving both a combinatorial function F and ^-regularization: 

pen : w ^ ^F(Supp(w)) + v ||tu||£, (1) 
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where /i and v are positive scalar coefficients. Since such non-convex discontinuous penalizations are 
untractable computationally, we undertake to construct an appropriate convex relaxation. The most 
natural convex surrogate for a non-convex function, say A, is arguably its convex envelope (i.e., its 
tightest convex lower bound) which can be computed as its Fenchel-Legendre bidual A** . However, 
one relatively natural requirement for a regularizer is to ask that it be also positively homogeneous 
(p.h. ) since this leads to formulations that are invariant by rescaling of the data. Our goal will 
therefore be to construct the tightest positively homogeneous convex lower bound of the penalty 
considered. 

Now, it is a classical result that, given a function A, its tightest p.h. (but not necessarily convex) 
lower bound A h is A h (w) = inf A>0 A ^ w ^ (see Rockafellar, 1970, p. 35). 

This is instrumental here given the following proposition: 

Proposition 1. Let A : R d — > M + be a real valued function, A h defined as above. Then C, the 
tightest positively homogeneous and convex lower bound of A, is well-defined and C — A* h * . 

Proof. The set of convex p.h. lower bounds of A is non-empty (since it contains the constant zero 
function) and stable by taking pointwise maxima. Therefore it has a unique majorant, which we call 
C. We have for all w e K d , A* h *(w) < C(w) < A(w), by definition of C and the fact that A h is an 
p.h. lower bound on A. We thus have for all A > 0, A* h *(Xw)X~ 1 ^ C(Xw)X^ 1 sC A(Xw)X~ 1 , which 
implies that for all w E R d , A* h *(w) ^ C(w) < A h (w). Since C is convex, we must have C = A* h * , 
hence the desired result. □ 



Using its definition we can easily compute the tightest positively homogeneous lower bound of the 
penalization of Eq. (1), which we denote pen^: 



pen h {w) = inf ^ F(Supp(w)) + v A p_1 



w 



\>oX v 1FV " 11 "p" 

Setting the gradient of the objective to 0, one gets that the minimum is obtained for 
X = (^) VP F(SuMw)) 1/p \H\p\ and that 

pen h (w) = ( W ) 1/9 {py) 1/p ew, 

where we introduced the notation 

e(w) := ^(Supp^)) 1 ^ ||w|| p . 

Up to a constant factor depending on the choices of /i and v, we are therefore led to consider the 
positively homogeneous penalty we just defined, which combines the two terms multiplicatively. 
Consider the norm Sl p (or f2 p if a reference to F is needed) whose dual norm 2 is defined as 

STJs):= max J*fJ g . . (2) 

We have the following result: 

Proposition 2 (Convex relaxation). The norm fl p is the convex envelope of Q. 
Proof. Denote Q(w) = \\w\\ p F(Supp(w)) 1 / q , and compute its Fenchel conjugate: 



9*(s) = maxw T s- ||w;|L F(Supp(w)) 1/9 

weR d 

= max max w t a Sa — \\wa \\ v F(A) 



1/9 



- ^ L {\\sA\\ q <F(Ay/o} - Hn;( s )<i}, 



The assumptions on the domain T>q of F and on the positivity of F indeed guarantee that Q* is a norm. 
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Figure 1: Penalties in 2D From left to right: the graph of the penalty pen, the graph of penalty 
pen^ with p = 2, and the graph of the norm in blue overlaid over graph of pen h , for the 
combinatorial function F : 2 V -> M + , with F{0) = 0, F({1}) = F{{2}) = 1 and F({1,2}) = 1.8. 

where t{ s gs} is the indicator of the set 5, that is the function equal to on S and +oo on S c . The 
Fenchel bidual of 0, i.e., its largest (thus tightest) convex lower bound, is therefore exactly tt p . □ 

Note that the function F is not assumed submodular in the previous result. Since the function 
depends on w only through \w\, by symmetry, the norm f2 p is also a function of \w\. Given 
Proposition 1, we have the immediate corollary: 

Corollary 1 (Two parts-code relaxation). Let p > 1. The norm w H> (qH^ipvy/PQpiw) is the 
tightest convex positively homogeneous lower bound of the function w i— > /ii 7 '(Supp(w)) + 

The penalties and relaxation results considered in this section are illustrated on Figure 1. 
2.1 Special cases. 

Case p = 1. In that case, letting dk — max^g^ F(A), the dual norm is H*(s) = max^gy \sk\/dk so 
that fli(w) = J2kev dk \wk\ is always a weighted ^i-norm. But regularizing with a weighted ^i-norm 
leads to estimators that can potentially have all sparsity patterns possible (even if some are obviously 
privileged) and in that sense a weighted ^-norm cannot encode hard structural constraints on the 
patterns. Since this means in other words that the £i-relaxations essentially lose the combinatorial 
structure of allowed sparsity patterns possibly encoded in F, we focus, from now on, on the case 
p > 1. 

Lasso, group Lasso. il p instantiates as the £\, l v and i'l/^p-norms for the simplest functions: 

• If F(A) = \A\, then fl p (w) — \\w\\i, since £l*(s) = max^ ^pyf = INI°°- ^ is interesting that 
the cardinality function is always relaxed to the €i-norm for all ^-relaxations, and is not an 
artifact of the traditional relaxation on an foe-ball. 

• If F(A) = 1{a#0} , then Cl p (w) = \\w\\ p , since fi*(s) = max A \\s A \\ q = \\s\\ q . 

• If F(A) = J2j=i 1 {AnG j ^0}, for {Gj)je{i,...,g} a partition of V, then fi, p (w) = Yfj=i II^gJp 
is the group Lasso or £i/£ p -uovm. (Yuan and Lin, 2006). This result provides a principled 
derivation for the form of these norms, which did not exist in the literature. For groups which 
do not form a partition, this identity does in fact not hold in general for p < oo, as we discuss 
in Section 4. 
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Submodular functions and p = oo. For a submodular function F and in the p = oo case, the 
norm that we derived actually coincides with the relaxation proposed by Bach (2010), and as 
showed in that work, f2^(w) = f(\w\), where / is a function associated with F and called the Lovdsz 
extension of F. We discuss the case of submodular functions in detail in Section 6. 



2.2 Lower combinatorial envelope 

The fact that when F is a submodular function, fi^ is equal to the Lovasz extension / on the 
positive orthant provides a guarantee on the tightness of the relaxation. Indeed / is called an 
"extension" because VA C 2 V , /(1a) = F(A), so that / can be seen to extend the function F to 
R d ; as a consequence, 0^(1^) = /(1a) = ^(^4); which means that the relaxation is tight for all w 
of the form w = cIa, for any scalar constant c € R and any set A C V. If F is not submodular, 
this property does not necessarily hold, thereby suggesting that the relaxation could be less tight in 
general. To characterize to which extend this is true, we introduce a couple of new concepts. 

Much of the properties of il p , for any p > 1, are captured by the unit ball of Cl^ or its intersection 
with the positive orthant. In fact, as we will see in the sequel, the relaxation plays a particular 
role, to establish properties of the norm, to construct algorithms and for the statistical analysis, 
since it it reflects most directly the combinatorial structure of the function F. 

We define the canonical polyhedron 3 associated to the combinatorial function as the polyhedron Vf 
defined by 

V F = {se R d , VA c V, s(A) < F(A)}. 

By construction, it is immediate that the unit ball of fl^ is {s € R d | \s\ e Vf}- 

From this polyhedron, we construct a new set-function which restitutes the features of F that are 
captured by Vf- 

Definition 2 (Lower combinatorial envelope). Define the lower combinatorial envelope (LCE) of 
F as the set-function F_ defined by: 

F-{A) = max s(A). 

sEVf 

By construction, even when F is not monotonic, F_ is always non-decreasing (because Vf C R+). 

One of the key properties of the lower combinatorial envelope is that, as shown in the next lemma, 
Cl^ is an extension of F_ in the same way that the Lovasz extension is an extension of F when F 
is submodular. 

Lemma 1. (Extension property) 0^,(1 a) = F-(A). 

Proof. From the definition of O^, Vf and F_, we get: Q^Ia) = max 1 A s = maxs T l^ = 

^Sj'O)^ 1 sev F 

F_{A) □ 

Functions that are close to their LCE have in that sense a tighter relaxation than others. 

A second important property is that a function F and its LCE share the same canonical polyhedron. 
This will result as a immediate corollary from the following lemma: 

Lemma 2. Vs e E^, max^cv pj^j ~ m&x Acv fT(J)- 

3 The reader familiar with submodular functions will recognize that the canonical polyhedron generalizes the sub- 
modular polyhedron usually defined for these functions. 
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Figure 2: Intersection of the canonical polyhedron with the positive orthant for three different 
functions F. Full lines materialize the inequalities s(A) < F(A) that define the polyhedron. Dashed 
line materialize the induced constraints s(A) < F_(A) that results from all constraints s(B) < 
F(B), B G 2 V . From left to right: (i) V F = 2 V and F_ = F = F + ; (ii) V F = {{2}, {1,2}} and 
F_({1}) < F({i}); (iii) Vp = {{1}, {2}} corresponding to a weighted £i-norm. 

Proof. Given that for all A, F_(A) ^ F(A), the left hand side is always smaller or equal to the right 
hand side. We now reason by contradiction. Assume that there exists s such that \/A C V, s(A) < 
uF(A) but that there exists B C V, s{B) > vF{B), then s' = ±s satisfies Wl C V, s'(A) < F(A). 
By definition of F_, the latter implies that F_(B) > s'(B) = |s(B) > \ ■ vF(B), where the last 
inequality results from the choice of this particular B. This would imply F_(B) > F(B), but by 
definition of F-, we have F-{A) < F(A) for all AdV. □ 

Corollary 3. V F = V F _ ■ 

But the sets {w G K d | \w\ G Vf} and {w G M d | |w| G Pf^} are respectively the unit balls of 
and SlocT . As a direct consequence, we have: 
Lemma 3. For all p> 1, = £l p ~ . 

By construction, is the largest function which lower bounds F, and has the same ^-relaxation 
as F, hence the term of lower combinatorial envelope. 

Figure 2 illustrates the fact that F and F_ share the same canonical polyhedron and that the value 
of F-(A) is determined by the values that F takes on other sets. This figure also suggests that 
some constraints { s(A) < F(A) } can never be active and could therefore be removed. This will be 
formalized in Section 2.3. 

To illustrate the relevance of the concept of lower combinatorial envelope, we compute it for a specify 
combinatorial function, the range function, and show that it enables us to answer the question of 
whether the relaxation would be good in this case. 

Example 1 (Range function). Consider, on V = [l,d\, the range function F : A t-¥ max(A) — 
mm(A) + 1 where min(A) (resp. max( J 4) y ) is the smallest (resp. largest) element in A. A motivation 
to consider this function is that it induces the selection of supports that are exactly intervals. Since 
— lj i G V , then for all s G Vf, we have s(A) < |A| < F(A). But this implies that T>f is the 
set of singletons and that F^(A) — \A\, so that f2 F is the l\-norm and is oblivious of the structure 
encoded in F. 

As we see from this example, the lower combinatorial envelope can be interpreted as the combina- 
torial function which the relaxation is actually able to capture. 
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2.3 Upper combinatorial envelope 

Let F be a set-function and V F its canonical polyhedron. In this section, we follow an intuition 
conveyed by Figure 2 and find a compact representation of F: the polyhedron V F has in many cases 
a number of faces which much smaller than 2 d . We formalize this in the next lemma. 

Lemma 4. There exists a minimal subset Vp of 2 V such that for s £ , 

s£V F *> (Vi4 G V F , s{A) < F(A)). 

Proof. To prove the result, we define as in Obozinski et al. (2011, Sec. 8.1) the notion of redundant 
sets: we say that a set A is redundant for F if 

3A U ...,A k £ 2 V \{A}, (Vi, s(AA < F(AA) => (s(A) < F(A)). 

Consider the set Vp of all non redundant sets. 

We will show that, in fact, A is redundant for F if and only if 

3^i, ...,A k £ V F \{A}, (Vi, s(AA < F{Ai)) => (s(A) < F(A)), 

which proves the lemma. 

Indeed, we can use a peeling argument to remove all redundant sets one by one and show recursively 
that the inequality constraint associated with a given redundant set is still implied by all the ones we 
have not removed yet. The procedure stops when we have reached the smallest set V F of constraints 
implying all the other ones. □ 

We call Vp the core set of F. It corresponds to the set of faces of dimension d — 1 of V F . 
This notion motivates the definition of a new set-function: 

Definition 4. (Upper combinatorial envelope) We call upper combinatorial envelope (UCE) the 
function F + defined by F + (A) = F{A) for A £ Vp and F + (A) = oo otherwise. 

As the reader might expect at this point, F + provides a compact representation which captures all 
the information about F that is preserved in the relaxation: 

Proposition 3. F, F_ and F + all define the same canonical polyhedron Vf_ = Vf = Vf + and share 
the same core set T> F . Moreover, VA £ V F , F_(A) = F(A) = F + (A). 

Proof. To show that f2 p = f2£ we just need to show Vf + — Vf- By the definition of F + we have 
Vf + = {s £ R d | s(A) < F(A), A £ Vp} but the previous lemma precisely states that the last set is 
equal to Vf- 

We now argue that, for all A £ V F , F_{A) = F(A) = F + (A). Indeed, the equality F(A) = 
F + (A) holds by definition, and, for all A £ V F , we need to have F(A) = F_(A) because F_(A) = 
max s6 j) F s(A) = max s£ p F s(A) and if we had F_ (A) < F(A), this would imply that A is redundant. 

□ 

Finally, the term "upper combinatorial envelope" is motivated by the following lemma: 

Lemma 5. F + is the pointwise supremum of all the set-functions H that are upper bounds on F 
and such that Vh = Vf. 
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Proof. We need to show that we have F + : A i-> sup{H(A) \ H e £p} with 

£ F := {H : 2 V -> R+ s.t. H > F and V H = Vf}- 

But for any H G £ F , "P/f = Vf implies that 7J_ = F— by definition of the lower combinatorial 
envelope. Moreover we have T>h C T>p since if A ^ D F , then A is redundant for F, i.e. there 
are Ai,...,A fe e £> F such that (Vi, s(^) < (s(A) < but < H(A), and 

thus, since = H_(Ai), this implies that A is redundant for i/, i.e., A ^ I?//. Now, assume 

there is A e 2? F n X>^. Since A is redundant for H then there are A\, . . . , A^ in 2?^\{^4} such that 
(Vi, s(Aj) < H(Ai)) => (s(A) < H(A)), but since A, e V H C 2? F we have = H_(Ai) = 

F_(Ai) and since A e V F wc also have = F_(A) so cither (VA' G X> ff , s(A') < 

(s(A) < so that A is redundant, which is excluded, or since H_(A) = max se p H s(A) = 

max s; s(A')<f_(A'), A'ev H S (A), we then have H_(A) > F-(A), which is also impossible. So we 
necessarily have T>h = T>p. To conclude the proof we just need to show that F + e £p and that 
F + > H for all H e £f\ this inequality is trivially satisfied for A ^ X> F+ , and since 2?h =Vp + , for 
AeD f+ , we have F+ (A) = F_ (A) = H_ (A) = H(A). □ 

The picture that emerges at this point from the results shown is rather simple: any combinatorial 
function F defines a polyhedron Vf whose faces of dimension d — 1 are indexed by a set Up C 2 V 
that we called the core set. In symbolic notation: V F = {s G R d | s(A) < F(A), A e V F }. All the 
combinatorial functions which are equal to F on Djr and which otherwise take values that are larger 
than its lower combinatorial envelope F_ , have the same £ p tightest positively homogeneous convex 
relaxation , the smallest such function being F_ and the largest F + . Moreover F-{A) = fi^(A), 
so that fi^ is an extension of F_ . By construction, and even if F is a non-decreasing function, F- 
is non-decreasing, while F + is obviously not a decreasing function, even though its restriction to 
Dp is. It might therefore seem an odd set-function to consider; however if Dp is a small set, since 
ftp = Qp + , and it provides a potentially much more compact representation of the norm, which we 
now relate to a norm previously introduced in the literature. 

3 Latent group Lasso, block-coding and set-cover penalties 

The norm H, p is actually not a new norm. It was introduced from a different point of view by Jacob 
ct al. (2009) (see also Obozinski et al., 2011) as one of the possible generalizations of the group Lasso 
to the case where groups overlap. 

To establish the connection, we now provide a more explicit form for tt p , which is different from the 
definition via its dual norm which we have exploited so far. 

We consider models that are parameterized by a vector w € R v and associate to them latent variables 
that are tuples of vectors of R v indexed by the power-set of V. Precisely, with the notation 



v = { 



v 



(v a )acv e (M. v ) 2 s.t. Supp(^) c A} 



we define the norms ft v as 





As suggested by notations and as first proved for p — 2 by Jacob et al. (2009), we have: 
Lemma 6. fi„ and O* are dual to each other. 
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An elementary proof of this result is provided in Obozinski et al. (2011) 4 . We propose a slightly 
more abstract proof of this result in appendix A using explicitly the fact that fi p is defined as an 
infimal convolution. 

We will refer to this norm f2 p as the latent group Lasso since it is defined by introducing latent 
variables v A that are themselves regularized instead of the original model parameters. We refer the 
reader to Obozinski et al. (2011) for a detailed presentation of this norm, some of its properties and 
some support recovery results in terms of the support of the latent variables. In Jacob et al. (2009) 
the expansion (3) did not involve all terms of the power-set but only a subcollection of sets G C 2 V . 
The notion of redundant set discussed in Section 2.3 was actually introduced by Obozinski et al. 
(2011, Sec. 8.1) and the set Q could be viewed as the core set Vp. A result of Obozinski et al. (2011) 
for p = 2 generalizes immediately to other p: the unit ball of fl p can be shown to be the convex hull 
of the sets D A = {w e R d \ \\w a \\p < F(A)- 1 ^}. This is illustrated in Figure 3. 

The motivation of Jacob et al. (2009) was to find a convex regularization which would induce sparsity 
patterns that are unions of groups in Q and explain the estimated vector w as a combination of a 
small number of latent components, each supported on one group of Q. The motivation is very 
similar in Huang et al. (2011) who consider an £ -type penalty they call block coding, where each 
support is penalized by the minimal sum of the coding complexities of a certain number of elementary 
sets called "blocks" which cover the support. In both cases the underlying combinatorial penalty is 
the minimal weighted set cover defined for a set B C V by: 

F(B) = min Vf^)^ s.t. V 5 A l A > 1 B , S A e {0, 1}, A C V. 

(SA)A ^ V AcV ACV 

While the norm proposed by Jacob et al. (2009) can be viewed as a form of "relaxation" of the 
cover-set problem, a rigorous link between the Iq and convex formulation is missing. We will make 
this statement rigorous through a new interpretation of the lower combinatorial envelope of F. 

Indeed, assume w.l.o.g. that w € For x, y e R v , we write x > y if Xj > for all i e V. Then, 
fiooH = min V F(A)||i; A || co s.t. V v A > w 

ACV ACV 
S m +AGV ACV 

since if (v a )acv is a solution so is (5 a 1a)acv with S A — ||w A ||oo. We then have 

F-(B) = min ^F(A)^, s.t. ^ S A l A > 1 B , 5 A e [0, 1], AcV, (4) 

ACV ACV 

because constraining S to the unit cube does not change the optimal solution, given that Is < 1. 
But the optimization problem in (4) is exactly the fractional weighted set-cover problem (Lovasz, 
1975), a classical relaxation of the weighted cover set problem in Eq. (4). 

Combining Proposition 2 with the fact that F_ (A) is the fractional weighted set-cover, now yields: 

Theorem 5. Q p (w) is the tightest convex relaxation of the function w i-> \\w\\ p F(Supp(w)) 1 / 9 where 
F(Supp(u>)) is the weighted set-cover of the support ofw. 

Proof. We have F_(A) < F(A) < F(A) so that, since F_ is the lower combinatorial envelope of F, 
it is also the lower combinatorial envelope of F, and therefore Qp~ = fl^ = Qp 7 . □ 



4 Thc proof in Obozinski et al. (2011) addresses the p = 2 case but generalizes immediately to other values of p. 
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Figure 3: Unit balls in M 2 for four combinatorial functions (actually all submodular) on two variables. 
Top left and middle row: p = oo; top right and bottom row: p = 2. Changing values of F may make 
some of the extreme points disappear. All norms are hulls of a disk and points along the axes, whose 
size and position is determined by the values taken by F. On top row: F(A) = F_(A) = | 1 / 2 (all 
possible extreme points); and from left to right on the middle and bottom rows: F(A) = \A\ (leading 
to || • Hi), F(A) = F_(A) = mm{\A\, 1} (leading to || • || p ), F(A) = F_(A) = il {An{2}#0} + W*}- 

This proves that the norm proposed by Jacob et al. (2009) is indeed in a rigorous sense a 
relaxation of the block-coding or set-cover penalty. 

Example 2. To illustrate the above results consider the block-coding scheme for subsets of V = 
{1, 2, 3} with blocks consisting only of pairs, i.e., chosen from the collection T> := {{1, 2}, {2, 3}, {1, 3}} 

with costs all equal to 1. The following table lists the values of F, F_ and F: 








{1} 


{2} 


{3} 


{1,2} 


{2,3} 


{1,3} 


{1,2,3} 


F 





oo 


oo 


oo 


1 


1 


1 


oo 


F 





1 


1 


1 


1 


1 


1 


2 







1 


1 


1 


1 


1 


1 


3/2 



Here, F is equal to its UCE (except that F + (0) = oo) and takes therefore non trivial values only on 
the core setVp = T> . Alljion-empty sets except V can be covered by exactly one set, which explains 
the cases where F_ and F take the value one. F(V) — 2 since V is covered by any pair of blocks 
and a slight improvement is obtained if fractional covers is allowed since for 5\ = 82 = ^3 = \, we 
have \ v = Si 1 {2 ,3} + h l{3,i} + ^3 l{i,2} and therefore F_(V) = 5i + 5 2 + S3 = §. 

The interpretation of the LCE as the value of a minimum fractional weighted set cover suggests a 
new interpretation of F + (or equivalently of Dp) as defining the smallest set of blocks {Vp) and 
their costs, that induce a fractional set over problem with the same optimal value. 
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It is interesting to note (but probably not a coincidence) that it is Lovasz who introduced the 
concept of optimal fractional weighted set cover, while we just showed that the value of that cover 

F F 

is precisely F_, i.e., the combinatorial function which is extended by Cl^ = fioo" and which, if F + 
is submodular is equal to the Lovasz extension. 

The interpretation of F_ as the value of a minimum fractional weighted cover set problem allows us 
also to show a result which is dual to the property of LCEs, and which we now present. 

3.1 Largest convex positively homogeneous function with same combina- 
torial restriction 

By symmetry with the characterization of the lower combinatorial envelope as the smallest combi- 
natorial function that has the same tightest convex and positively homogeneous (p.h.) relaxation as 
a given combinatorial function F, we can, given a convex positively homogeneous function g, define 
the combinatorial function F : A i-> g(lA), which by construction, is the combinatorial function 
which g extends (in the sense of Lovasz ) to R+, and ask if there exists a largest convex and p.h. 
function g + among all such functions. It turns out that this problem is well-posed if the question 
is restricted to functions that are also coordinate-wise non-decreasing. Perhaps not surprisingly, it 
is then the case that the largest convex p.h. function extending the same induced combinatorial 
function is precisely Cl^, as we show in the next lemma. 

Lemma 7. (Largest convex positively homogeneous extension) Let g be a convex, p.h. and coordinate- 
wise non- decreasing function defined on . Define F as F : Ah> 9(^a) and denote by F_ its lower 
combinatorial envelope. 

Then F = F_ and Vw G R d , g{\w\) < n£,(to). 

Proof. From Equation (4), we know that F- can be written as the value of a minimal weighted 
fractional set-cover. But if 1b < Xmcv $ A ^a, we have 



where the first inequality results from the convexity and homogeneity of <?, and the second from 
the assumption that it is coordinate-wise non-decreasing. As a consequence, injecting the above 
inequality in (4), we have F_(B) > F(B). But since, we always have F_ < F, this proves the 
equality. 

For the second statement, using the coordinate- wise monotonicity of g and its homogeneity, we have 
<?(IH) ^ H w l|oo5(lsupp(u>)) = ||w||oo-F 1 (Supp(w)). Then, taking the convex envelope of functions on 
both sides of the inequality we get g(\ ■ |)** < (|| • || co i ;l (Supp(-)))** = where (•)* denotes the 
Fenchel-Legcndre transform. □ 

4 Examples 

subsectionOverlap count functions, their relaxations and the fi/l p -norms. A natural family of 
set functions to consider are the functions that, given a collection of sets Q C 2 V arc defined as the 
number of these sets that are intersected by the support: 



£^(U)>s(£ac^)><7(1 S ) 



acv 




(5) 



Beg 
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Since A i->- 1{ahg^0} is clearly submodular and since submodular functions form a positive cone, 
all these functions are submodular, which implies that f2p n is a tight relaxation of F n . 

Overlap count functions vs set-covers. As mentioned in Section 2.1, if Q is a partition, the 
norm fip n is the fi/fp-norm; in this special case, F n is actually the value of the minimal (integer- 
valued) weighted set-cover associated with the sets in Q and the weights do- 

However, it should be noted that, in general, the value of these functions is quite different from the 
value of a minimal weighted set-cover. It has rather the flavor of some sort of "maximal weighted 
set-cover" in the sense that any set that has a non-empty intersection in the support would be 
included in the cover. We call them overlap count functions. 

l v relaxations of F n vs li/^-norms. In the case where p = oo, Bach (2010) showed that even 
when groups overlap we have ^^(w) = J^Beg ^sII^gIIoo, since the Lovasz extension of a sum of 
submodular functions is just the sum of the Lovasz extensions of the terms in the sum. 

The situation is more subtle when p < oo: in that case, and perhaps surprisingly, fip n is not the 
weighted l\jl p norm with overlap (Jenatton et al., 2011a), also referred to as the overlapping group 
Lasso (which should clearly be distinguished from the latent group Lasso) and which is the norm 
defined by w i— > ~^2g e g d' B \\wG\\ P - The norm f2^ n does not have a simple closed form in general. 
In terms of sparsity patterns induced however, fip n behaves like , and as a result the sparsity 
patterns allowed by Qp n are the same as those allowed by the corresponding weighted i\jl p norm 
with overlap. 

£p-relaxation of F n vs latent group Lasso based on Q. It should be clear as well that f^ n 
is not itself the latent group Lasso associated with the collection Q and the weights da in the sense 
of Jacob et al. (2009). Indeed, the latter corresponds to the function F u : A 1{a^0} + L {Aeg}, 
or to its LCE which is the minimal value of the fractional weighted set cover associated with Q. 
Clearly, F u is in general strictly smaller than F n and since the relaxation of the latter is tight, 
it cannot be equal to the relaxation of the former, if the combinatorial functions are themselves 
different. Obviously, the function fip n is still as shown in this paper, another latent group Lasso 
corresponding to a fractional weighted set cover and involving a larger number of sets that the ones 
in Q (possibly all of 2 V ). This last statement leads us to what might appear to be a paradox, which 
we discuss next. 

Supports stable by intersection vs formed as unions. Jenatton et al. (2011a) have shown 
that the family of norms they considered induces possible supports which form a family that is 
stable by intersection, in the sense that the intersection of any two possible support is also a possible 
support. But since as mentioned above they have the same support as the norms £lp n , for 1 < p < oo, 
which are latent group Lasso norms, and since Jacob et al. (2009) have discussed the fact that the 
supports induced by any norm f2 p are formed by unions of elements of the core set T>, is might 
appear paradoxical that the allowed support can be described at the same time as intersections and 
as unions. There is in fact not contradiction because in general the set of supports that are induced 
by the latent group Lasso are in fact not stable by union in the sense that some unions are actually 
"unstable" and will thus not be selected. 

Three different norms. To conclude, we must, given a set of groups Q and a collection of weights 
(^G)Gee, distinguish three norms that can be defined from it, the weighted £i/£ p -norm with overlap, 
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the norm f2p n obtained as the £ p relaxation of the submodular penalty F n , and finally, the norm 

F[.U 

ftp obtained as the relaxation of the set-cover or block-coding penalty with the weights da- 

Some of the advantages of using a tight relaxation still need to be assessed empirically and theoreti- 
cally but the possibility of using ^-relaxation for p < oo removes the artifacts that were specific to 
the case. 

4.1 Chains, trees and directed acyclic graphs. 

Instances of the three types of norms above are naturally relevant to induce sparsity pattern on 
structures such as chains, trees and directed acyclic graphs. 

The weighted £i/£ p -norm with overlap has been proposed to induce interval patterns on chains and 
rectangular or convex patterns on grids (Jenatton et al., 2011a), for certain sparsity patterns on 
trees (Jenatton et al., 2011b) and on directed acyclic graphs (Mairal et al., 2011). 

One of the norm considered in Jenatton et al. (2011a) provides a nice example of an overlap count 
function, which it is worth presenting. 

Example 3 (Modified range function). A shown in Example 1 in Section 2.2, the natural range 
function on a sequence leads to a trivial LCE. Consider now the penalty with the form of Eq. (5) 
with Q the set of groups defined as 

g = {{l,kj | l<fc<p}u{[fc,f>] \ l<k<p}. 

A simple calculation shows that F n (0) = and that for A ^ 0, F n (A) = d — 1 + range(A). This 
function is submodular as a sum of submodular functions, and thus equal to it lower combinatorial 
envelope, which implies that the relaxation retains the structural a prior encoded by the combinatorial 
function itself. We will consider the l 2 relaxation of this submodular function in the experiments 
(see Section 7) and compare it with the l\ll<z-norm with overlap of Jenatton et al. (2011a). 

In the case of trees and DAGs, a natural counting function to consider is the number of nodes which 
have at least one descendant in the support, i.e. functions of the form F n :i4 J2iev l{AnD;^0}j 
where Di is the set containing node i and all its descendants. It is related to the weighted 
ti/lp— norms which were considered in Jenatton et al. (2011b) (p e {2, oo}) for and Mairal and 
Yu (2012) (p = oo). As discussed before, while these norms include if p = oo, they otherwise do 
not correspond to the tightest relaxation, which it would be interesting to consider in future work. 

Beyond the standard group Lasso and the exclusive group Lasso, there are very few instances of the 
norm fif appearing in the literature. One such example is the wedge penalty considered in Micchelli 
et al. (2011). 

Latent group Lasso formulations are also of interest in these cases, and have not been yet been 
investigated much, with the exception of Mairal and Yu (2012), which considered the case of a 
parameter vector with coefficients indexed by a DAG and Q the set of all paths in the graph. 

There are clearly other combinatorial functions of interest than submodular functions and set-cover 
functions. We present an example of such functions in the next section. 

4.2 Exclusive Lasso 

The exclusive Lasso is a formulation proposed by Zhou et al. (2010) which considers the case where 
a partition Q = {Gi, . . . , Gfe} of V is given and the sparsity imposed is that w should have at most 
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one non-zero coefficient in each group Gj. The regularizer proposed by Zhou et al. (2010) is the 
£ p /£i-norm defined 5 by |MU P /£i = (X^Gea II^gIIi) 1 ^- Is this the tightest relaxation? 

A natural combinatorial function corresponding to the desired constraint is the function F(A) defined 
by F(0) = 0, F(A) = 1 if max Geg \A n G\ = 1 and F(A) = oo otherwise. 

To characterize the corresponding Cl p we can compute explicitly its dual norm Cl*: 



max — — ; - 
AcV,A^0 F(A) 

max ||« A ||9 s.t. \AnG\<l,G £0 



k k k 

II £> 

j 1 1 OO ' 



max >^|si-| 9 = > max|si-| 9 = > HsgJI™ 



which shows that 12* is the ^/foo-norm or equivalently that Cl p is the l p /fi-norm and provides a 
theoretical justification for the choice of this norm: it is indeed the tightest relaxation! It is inter- 
esting to compute the lower combinatorial extension of F which is F_(A) = 0^(1^) = HIaII^/^ = 
maxcgg \A n G\. This last function is also a natural combinatorial function to consider; by the pre- 
vious result F_ has the same convex relaxation as F, but it would be however less obvious to show 
directly that Cl p ~ is the l v jl\ (see appendix B for a direct proof which uses Lemma 7). 



5 A variational form of the norm 

Several results on Cl p rely on the fact that it can be related variationally to floe. 
Lemma 8. Cl p admits the two following variational formulations: 

Cl p (w) = max ^2n] /q \w l \ s.t. VAcV,k(A)<F(A) 

k<eR + ie y 
= min V + -^00(77). 

Proof. Using Fenchel duality, we have: 

Clr,(w) — max s T w s.t. Cl*(w) < 1 
p ser p ~ 

= max s T w s.t. \/A C V, II s A 112 < F(A) by definition of Cl* 

= max Y, K l /q \ w i\ s -t- VAcV, k(A) < F(A). 
KeRd + iev 



'The Exclusive Lasso norm which is t p /t\ should not be confused with the group Lasso norm which is i\jt v 
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But it is easy to verify that nj q \wi I = min - ^ Wl \ + -<n iKi with the minimum attained for m = -^4 . 
We therefore get: 

^ \ w .\p \ 

SlJw) = max min --^±r- + -r] T k s.t. VA C V, k(A) < F(A) 

= min max V Il^i!^ + I„ T K s . t . \/A C V, k(A) < F(A) 
v e^ + .em« + P jjf" 1 q 

■ 1 Kl p 1« 

where we could exchange minimization and maximization since the function is convex-concave in 
r\ and k, and where we eliminated formally k by introducing the value of the dual norm Sloo(v) = 
max Kg p F k t t]. □ 



Since 0^ is convex, the last formulation is actually jointly convex in (w, 77) since (x, z) 1— » ^ + ^ z 



Nil? 

- - 
p 

is convex, as the perspective function of t n- t p (see Boyd and Vandenbcrghe, 2004, p. 

It should be noted that the norms ft p therefore belong to the broad family of H-norms as defined 6 
in Bach et al. (2012, Sec. 1.4.2.) and studied by Micchelli et al. (2011). 

The above result is particularly interesting if F is submodular since ftoo is then equal to the Lovasz 
extension of F on the positive orthant (Bach, 2010). In this case in particular, it is possible, as we 
will see in the next section to propose efficient algorithms to compute ft p and ft*, the associated 
proximal operators, and algorithms to solve learning problems regularized with ft p thanks to the 
above variational form. 

For submodular functions, these variational forms are also the basis for the local decomposability 
result of Section 6.4 which is key to establish support recovery in Section 6.5. 



6 The case of submodular penalties 

In this section, we focus on the case where the combinatorial function F is submodular. 

Specifically, we will consider a function F defined on the power set 2 V of V = {1, . . . , d}, which is 
nondecreasing and submodular, meaning that it satisfies respectively 

VA, B C V, A c B => F(A) ^ F(B), 



Moreover, we assume that F(0) — 0. These set-functions arc often referred to as polymatroid set- 
functions (Fujishige, 2005; Edmonds, 2003). Also, without loss of generality, we assume that F 
is strictly positive on singletons, i.e., for all k e V, F({k}) > 0. Indeed, if F({k}) = 0, then by 
submodularity and monotonicity, if A 9 k, F{A) = F(A\{k}) and thus we can simply consider 
V\{k} instead of V. 

Classical examples are the cardinality function and, given a partition of V into G\ U • • • U Gk = V, 
the set-function A i->- F(A) which is equal to the number of groups G\,...,Gk with non empty 
intersection with A, which, as mentioned in section 2.1 leads to the grouped ^i/£ p -norm. 

"Note that H-norms are in these references defined for p = 2 and that the variational formulation proposed here 
generalizes this to other values of p g (1, 00) 



15 



With a slightly different perspective than the approach of this paper, Bach (2010) studied the special 
case of the norm when p = oo and F is submodular. As mentioned previously, he showed that 
in that case the norm Q^, is the Lovasz extension of the submodular function F, which is a well 
studied mathematical object. 

Before presenting results on £ p relaxations of submodular penalties, we review a certain number of 
relevant properties and concepts from submodular analysis. For more details, see, e.g., Fujishige 
(2005), and, for a review with proofs derived from classical convex analysis, see, e.g., Bach (2011). 

6.1 Review of submodular function theory 

Lovasz extension. Given any set-function F, one can define its Lovasz extension f : — > K, as 
follows: given w <E R+, we can order the components of w in decreasing order Wj 1 > • • • > wj p > 0, 
the value f(w) is then defined as 

p-i 

x jk x jk+i 

)F({j 1 ,...,j k }) + x jp F({j 1 ,...,j p }) (6) 

fc=i 
v 

= ^^^({jl,...,^})-^!,---.^-!})]. (7) 
fc=l 

The Lovasz extension / is always piecewise-linear, and when F is submodular, it is also convex (see, 
e.g., Fujishige (2005); Bach (2011)). Moreover, for all 5 e {0, l} d , f(S) = F(Supp((5)) and / is in 
that sense an extension of F from vectors in {0, l} d (which can be identified with indicator vectors 
of sets) to all vectors in R+. Moreover, it turns out that minimizing F over subsets, i.e., minimizing 
/ over {0, l} d is equivalent to minimizing / over [0, l] d (Edmonds, 2003). 

Submodular polyhedron and norm We denote by V the submodular polyhedron (Fujishige, 
2005), defined as the set of s e such that for all A C V, s(A) sC F(A), i.e., V = {s £ W*_, VA C 
V, s(A) ^ F(A)}, where we use the notation s(A) = ^2 ke ^Sk- With our previous definitions, the 
submodular polyhedron is just the canonical polyhedron associated with a submodular function. 
One important result in submodular analysis is that, if F is a nondecrcasing submodular function, 
then we have a representation of / as a maximum of linear functions (Fujishige, 2005; Bach, 2011), 
i.e., for all tuelf, 

f(w) = max w T s. (8) 

We recognize here that the Lovasz extension of a submodular function F is directly related to the 
norm in that f(\w\) = Sl^(w) for all w G R d . 

Greedy algorithm Instead of solving a linear program with d + 2 d constraints, a solution s to 
(8) may be obtained by the following algorithm (a.k.a. "greedy algorithm"): order the components 
of w in decreasing order Wj 1 • • • Wj d , and then take for all k € V, sj k = F({ji, . . . ,jk}) — 
F({ji, . . . ,jf.-i}). Moreover, if w G R d has some negative components, then, to obtain a solution to 
max sE p w T s, we can take Sj k to be simply equal to zero for all k such that Wj k is negative (Edmonds, 
2003). 

Contraction and restriction of a submodular function. Given a submodular function F 
and a set J, two related functions, which are submodular as well, will play a crucial role both 
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algorithmically and for the theoretical analysis of the norm. Those are the restriction of F to a set 
J, denoted Fj, and the contraction of F on J, denoted F J . They are defined respectively as 

Fj : A ^ F(A n J) and F J : AhF(AU J)-F(i). 

Both Fj and F J are submodular if F is. 

In particular the norms ftp J : R J — » M + and fi^ '' : R J " — > M + associated respectively with Fj and 
F J will be useful to "decompose" ftp in the sequel. We will denote these two norms by fij and il J 
for short. Note that their domains are not R d but the vectors with support in J and J c respectively. 

Stable sets. Another concept which will be key in this section is that of stable set. A set A is said 
stable if it cannot be augmented without increasing F, i.e., if for all sets B D A, B 7^ A =>■ -F(-B) > 
F(^4). If F is strictly increasing (such as for the cardinality), then all sets are stable. The set of 
stable sets is closed by intersection. In the case p = 00, Bach (2011) has shown that these stable 
sets were the only allowed sparsity patterns. 

Separable sets. A set A is separable if we can find a partition of A into A = B\\J - ■ - UBk such that 
F(A) = F(B\) + • • • + F(Bk). A set A is inseparable if it is not separable. As shown in Edmonds 
(2003), the submodular polytope V has full dimension d as soon as F is strictly positive on all 
singletons, and its faces are exactly the sets {s(A) = F(A)} for stable and inseparable sets A. With 
the terminology that we introduced in Section 2.3, this means that the core set of F is the set T>p 
of its stable and inseparable sets. In other words, we have V = {s € R d , \/A € Dp, s(A) ^ F(A)}. 
The core set will clearly play a role when deriving concentration inequalities in Section 6.5. For the 
cardinality function, stable and inseparable sets are singletons. 

6.2 Submodular function and lower combinatorial envelope 

A few comments are in order to confront submodularity to the previously introduced notions as- 
sociated with cover-sets, and lower and upper combinatorial envelopes. We have showed that 
F-(A) = ^^(l^). But for a submodular function fio^l^) = /(1a) = F(A) since / is the Lovasz 
extension of F. This shows that a submodular function is its own lower combinatorial envelope. 
However the converse is not true: a lower combinatorial envelope is not submodular in general. 
Indeed, in example 2, we have F_({1, 2}) + F_({2, 3}) f F_({2}) + F_({1, 2, 3}). 

The core set of a submodular function is the set T>p of its stable and inseparable sets, which implies 
that F can be retrieved as the value of the minimal fractional weighted set cover the sets A G Vp 
with weights F(A). 

6.3 Optimization algorithms for the submodular case 

In the context of sparsity and structured sparsity, proximal methods have emerged as methods of 
choice to design efficient algorithm to minimize objectives of the form f(w) + XQ(w), where / is a 
smooth function with Lipschitz gradients and O is a proper convex function (Bach et al., 2012). In 
a nutshell, their principle is to linearize / at each iteration and to solve the problem 

min V/(w t ) T (w - w t ) + ^r\\w- w t \\ 2 + Afi(io), 
wem d I 
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for some constant L. This problem is a special case of the so-called proximal problem: 

min \\\w- z\\l + \VL p {w). (9) 

The function mapping z to the solution of the above problem is called proximal operator. If this 
proximal operator can be computed efficiently, then proximal algorithm provide good rates of con- 
vergence especially for strongly convex objectives. We show in this section that the structure of 
submodular functions can be leveraged to compute efficiently Q p , fi* and the proximal operator. 



6.3.1 Computation of O p and f2*. 

A simple approach to compute the norm is to maximize in k in the variational formulation (6). This 
can be done efficiently using for example a conditional gradient algorithm, given that maximizing 
a linear form over the submodular polyhedron is done easily with the greedy algorithm (see Section 
6.1). 

We will propose another algorithm to compute the norm based on the so-called decomposition al- 
gorithm, which is a classical algorithm of the submodular analysis literature that makes it possible 
to minimize a separable convex function over the submodular polytope efficiently (see, e.g., Bach, 
2011, Section 8.6). 

Since the dual norm is defined as Cl*(s) — max^ cl /^^ F(A)i/g > to C0m pute it from s, we need to 
maximize efficiently over A, which can be done, for submodular functions, through a sequence of 
submodular function minimizations (see, e.g., Bach, 2011, Section 8.4). 



6.3.2 Computation of the proximal operator 

Using Eq. (6), we can reformulate problem (9) as 

min -\\w — zll 2 . + XflJw) = min max - \\w — z\\% + A K,]^ q \wi 
went" 2" 112 py ' wm^mtnv 2 11 112 ^ 

max > mm < -(»', - '-;}'+ Xk^^Iu 



ax V" min j hw, - Zi) 2 + Xk\ 



= max VV^), 



with : Ki i y min^gR <y\{wi - z,) 2 + Xn] /q \wi\ j. 

Thus, solving the proximal problem is equivalent to maximizing a concave separable function 
^jf/'^Kj) over the submodular polytope. For a submodular function, this can be solved with a 
"divide and conquer" strategy which takes the form of the so-called decomposition algorithm in- 
volving a sequence of submodular function minimizations (see Groenevelt, 1991; Bach, 2011). This 
yields an algorithm which finds a decomposition of the norm and applies recursively the proximal 
algorithm to the two parts of the decomposition corresponding respectively to a restriction and a 
contraction of the submodular function. We explicit this algorithm as Algorithm 1 for the case 
V = 2- 

Applying this decomposition algorithm in the special case where A = yields a decomposition 
algorithm, namely Algorithm E.2, to compute the norm itself (see appendix E.2). 
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Algorithm 1 Computation x = Prox^F (2) 

Require: z e M d , A > 

1: Let A = {j\ z 3 ^ 0} 

2: if A ^ V then 

3: Set s A = Prox AO F4 
4: Set a; a<= = 

5: return x by concatenating a; ,4 and xa c 

6: end if 2 

7: Let f € M d with U = ^F(V) 

8: Find A minimizing the submodular function F — t 

9: if A = V then 

10: return x = (\\z\\ 2 - \y/F(V)) + J fa 

11: end if 

12: Let x A = Prox Aj:j F A (z A ) 

13: Let x^c = Prox AoF A {za") 

14: return a; by concatenating xa and x^c 



6.4 Weak and local decomposability of the norm for submodular func- 
tions. 

The work of Negahban et al. (2010) has shown that when a norm is decomposable with respect to a 
pair of subspaces A and B, meaning that for all a £ A and (3 e B^- we have f2(a + /3) = 0(a) + fi(/3), 
a common proof scheme allows to show support recovery results and fast rates of convergence in 
prediction error. For the norms we are considering, this type of assumption would be too strong. 
Instead, we follow the analysis of Bach (2010) which considered the case p = 00 and which only 
requires some weaker form of decomposability. The decompositions involve Qj and Q J which are 
respectively the norms associated with the restriction and the contraction of the submodular function 
F to or on the set J. 

Concretely, let c = ^ with M = max^gy F({k}) and 

fh = min F(A U {k}) - F(A) s.t. F(A U {k}) > F(A). 

A.k 

Then we have: 

Proposition 4. (Weak and local decomposability) 

Weak decomposability. For any set J and any w <G R d , we have 

Local decomposability. Let K = Supp(w) and J the smallest stable set containing K, if \\wjc \\ p < 
c 1 / p mm i<£ K \wi\, then 

Note that when p = 00, if J = K , the condition becomes min^gj |t0j| ^ max^gjc |io f |, and we recover 
exactly the corresponding result from Bach (2010). 

This proposition shows that a sort of reverse triangular inequality involving the norms ft, £lj and 
always holds and that if there is a sufficiently large positive gap between the values of w on J 
and on its complement then Q can be written as a separable function on J and J c . 
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6.5 Theoretical analysis for submodular functions 

In this section, we consider a fixed design matrix X <E R nxp and y <E K™ a vector of random responses. 
Given A > 0, we define liasa minimizer of the regularized least-squares cost: 

min^gRd ^\\y-Xw\\l + Xfl(w). (10) 

We study the sparsity-inducing properties of solutions of (10), i.e., we determine which patterns are 
allowed and which sufficient conditions lead to correct estimation. 

We assume that the linear model is well-specified and extend results from Zhao and Yu (2006) for 
sufficient support recovery conditions and from Negahban et al. (2010) for estimation consistency, 
which were already derived by Bach (2010) for p = oo. The following propositions allow us to 
retrieve and extend well-known results for the ^-norm. 

Denote by p the following constant: 

F(B)-F(A) fn „. 

p= min v J„, ; € o,i . 

AcB,F(B)>F(A) F(B\A) y ' 

The following proposition extends results based on support recovery conditions (Zhao and Yu, 2006): 
Proposition 5 (Support recovery). Assume that y = Xw* + as, where e is a standard multi- 
variate normal vector. Let Q = ^X T X e M <ixd . Denote by J the smallest stable set containing the 
support Supp(w*) of w* . Define v — min^^o \ w *j \ > an d assume n = X mm (Qjj) > 0. 

If the following generalized Irrepresentability Condition holds: 

> o, (n J y ({^j{Q-j)Qjj)) jeJ ) < i - v, 

then, if A < 2| j^/vFi j) 1 - 1 /? > ^ e m ^ m ^ zer u> is unique and has support equal to J, with probability 
larger than 1 - 3P(fT (z) > , where z is a multivariate normal with covariance matrix Q. 

In terms of prediction error the next proposition extends results based on restricted eigenvalue 
conditions (see, e.g. Negahban et al., 2010). 

Proposition 6 (Consistency). Assume that y = Xw* + as, where e is a standard multivariate 
normal vector. Let Q = ^X T X e R dxd . Denote by J the smallest stable set containing the support 
Supp{w*) of w* . 

If the following ilj -Restricted Eigenvalue condition holds: 

VAeK d , (fi J (Ajc)<3fij(Aj)) => ( A T QA ^ kQ,j(Aj) 2 ), 

then we have 

tt(w-w*)^ ^ and - \\Xw - Xw* \\% < 

np z n np l 

with probability larger than 1 — ¥(il*(z) > A ^" ) where z is a multivariate normal with covariance 
matrix Q. 



The concentration of the values of O* (z) for z is a multivariate normal with covariance matrix Q 
can be controlled via the following result. 

Proposition 7. Let z be a normal variable with covariance matrix Q that has unit diagonal. Let 
Vp be the set of stable inseparable sets. Then 

( , Ml 1 /? 1^1(1/9-1/2)+ \ 

P(n'(*) > l^MDA) max XL^ +% max j < «"» ^. (11) 
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Figure 4: Set Q of overlapping groups defining the norm proposed by Jenatton et al. (2011a) (set in 
blue or green and their complements) and an example of corresponding induced sparsity patterns (in red), 
respectively for interval patterns in ID (left) and for rectangular patterns in 2D (right). 



7 Experiments 
7.1 Setting 

To illustrate the results presented in this paper we consider the problem of estimating the support 
of a parameter vector w € K d , when its support is assumed either 

(i) to form an interval in [1, d] or 

(ii) to form a rectangle [fc min , A: max ] x [k' min , k' max \ C [Mi] x [l,d 2 j, with d = did 2 . 

These two settings were considered in Jenatton et al. (2011a). These authors showed that, for both 
types of supports, it was possible to construct an £± / £ 2 -norm with overlap based on a well-chosen 
collection of overlapping groups, so that the obtained estimators almost surely have a support of 
the correct form. Specifically, it was shown in Jenatton et al. (2011a) that norms of the form 
w i— > X)sgc ll^slb induce sparsity patterns that are exactly intervals of V = {1, . . . ,p} if 

G = {[l,k] | 1 < k < P }U {[k,p] | 1 < k <p}, 

and induce rectangular supports on V = V\ X V 2 with V\ :— {1, . . . ,pi} and V% := {1, . . . ,£2} if 

g = {ii,kjxv 2 \i<k<p 1 }u{ik,p 1 ixv 2 \i<k<p 1 } 

U{^ X [l,fc] I 1 < k < p 2 } U {V x x \k,p 2 }} I 1 < k < p 2 ). 



These sets of groups are illustrated on Figure 4, and, for the first case, the set g has already discussed 
in Example 3 to define a modified range function which is submodular. 

Moreover, the authors showed that with a weighting scheme leading to a norm of the form w i-> 
TliB^g II^b ^ 5 !!: where o denotes the Hadamard product and d B € is a certain vector of weights 
designed specifically for these case 7 it is possible to obtain compelling empirical results in terms of 
support recovery, especially in the ID case. 

Interval supports. From the point of view of our work, that is, approaching the problem in terms 
of combinatorial functions, for supports constrained to be intervals, it is natural to consider the 
range function as a possible form of penalty: F (A) := range(yl) = i max (A) — i m in(^) + 1- Indeed 
the range function assigns the same penalty to sets with the same range, regardless of whether these 
sets are connected or have "holes" ; this clearly favors intervals since they are exactly the sets with 

7 We refer the reader to the paper for the details. 
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Figure 5: Examples of the shape of the signals used to define the amplitude of the coefficients of w on the 
support. Each plot represents the value of Wi as a function of i. The first (w constant on the support), third 
(wi = g(ci) with g : x H > | sin(a;) sin(5a;)|) and last signal (w% "~' A/"(0, 1)) are the ones used in reported 
results. 

the largest support for a given value of the penalty. Unfortunately, as discussed in the Example 1 
of Section 2.2, the combinatorial lower envelope of the range function is A i— ► \A\, the cardinality 
function, which implies that ilp° is just the ^i-norm: in this case, the structure implicitly encoded 
in Fq is lost through the convex relaxation. 

However, as mentioned by Bach (2010) and discussed in Example 3 the function F r defined by 
F r (A) = d — 1 + range(A) for 4^0 and F(0) — is submodular, which means that Q,^ r is a tight 
relaxation and that regularizing with it leads to tractable convex optimization problems. 

Rectangular supports. For the case of rectangles on the grid, a good candidate is the function 
F 2 with F 2 {A) = F r (Ui(A)) + F r (Tl 2 (A)) with 11* (A) the projection of the set A along the ith axis 
of the grid. 

This makes of f2^ r and il^ 2 two good candidates to estimate a vector w whose support matches 
respectively the two described a priori. 

7.2 Methodology 

We consider a simple regression setting in which w £ M. d is a vector such that Supp(w) is either an 
interval on [1, d] or a rectangle on a fixed 2D grid. We draw the design matrix X £ M. nxd and a noise 
vector e £ R™ both with i.i.d. standard Gaussian entries and compute y — Xw + e. We then solve 
problem (10), with chosen in turn to be the ^i-norm (Lasso), the elastic net, the norms £1^ for 
p £ {2, oo} and F chosen to be F r or F 2 in ID and 2D respectively; we consider also the overlapping 
^i/^2-norm proposed by Jenatton et al. (2011a) and the weighted overlapping ^i/£ 2 -norm proposed 
by the same authors, i.e., Cl(w) — X^bgg \\ w b ° <^ S ||2 with the same notations as before 8 . 

We assess the estimators obtained through the different regularizers both in terms of support recovery 
and in terms of mean-squared error in the following way: assuming that held out data permits to 
choose an optimal point on the regularization path obtained with each norm, we determine along 
each such path, the solution which either has a support with minimal Hamming distance to the true 
support or the solution which as the best £ 2 distance, and we report the corresponding distances as 
a function the sample size on Figures 6 and 7 respectively for the ID and the 2D case. 

8 Note that we do not need to compare with an iinfty counterpart of the unweighted norm considered in Jenatton 
et al. (2011a) since for p = oo the unweighted £i/£oo norm denned with the same collection Q is exactly the norm 
this follows from the form of F r as defined in Example 3 and the preceding discussion. 
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Finally, we assess the incidence of the fluctuation in amplitude of the coefficients in the vector w 
generating the data: we consider different cases among which: 

(i) the case where w has a constant value on the support, 

(ii) the case where Wi varies as a modulated cosine, with tOj = g(ci) for c a constant scaling and 
g : x i-> | cos(x) cos(5x)| 

(iii) the case where Wi is drawn i.i.d. from a standard normal distribution. 

These cases (and two others for which we do not report results) are illustrated on Figure 5. 

7.3 Results 

Results reported for the Hamming distances in the left columns of Figures 6 and 7 show that 
the norms Vt^ T and fl^ 2 perform quite well for support recovery overall and tend to outperform 
significantly their £ 00 counterpart in most cases. In ID, several norms achieve reasonably small 
Hamming distance, including the ^i-norm, the norm an d the weighted overlapping £i/^ 2 -norm 
although the latter clearly dominates for small values of n. 

In 2D, fi^ 2 leads clearly to smaller Hamming distances than other norms for the larger values of n, 
while is outperformed by the £i-norm for small sample sizes. It should be noted that neither Q^ 2 
nor the weighted overlapping ^/^-norm that performed so well in ID achieve good results. 

The performance of the £2 relaxation tends to be comparatively better when the vector of parameter 
w has entries that vary a lot, especially when compared to the relaxation. Indeed, the choice 
of the value of p for the relaxation can be interpreted as encoding a prior on the joint distribution 
of the amplitudes of the Wi\ as discussed before, and as illustrated in Bach (2010) the unit balls 
for the loo relaxations display additional "edges and corners" that lead to estimates with clustered 
values of |tOj|, corresponding to an priori that many entries in w have identical amplitudes. More 
generally, large values of p correspond to the prior that the amplitude varies little while their vary 
more significantly for small p. 

The effect of this other type of a priori encoded in the regularization is visible when considering 
the performance in terms of £2 error. Overall, both in ID and 2D all methods perform similarly in 
£2 error, except that when w is constant on the support, the £0^ relaxations Q^J and fi^ 2 perform 
significantly better, and this is the case most likely because the additional "corners" of these norms 
induce some pooling of the estimates of the value of the Wi, which improves their estimation. By 
contrast it can be noted that when to is far from constant the £ x relaxations tend to have slightly 
larger least-square errors, while, on contrary, the £i-regularisation tends to be among the better 
performing methods. 

8 Conclusion 

We proposed a family of convex norms defined as relaxations of penalizations that combine a combi- 
natorial set-function with an ^,-norm. Our formulation allows to recover in a principled way classical 
sparsity inducing regularizations such as £1, ^i/£ p -norms or ^ p /^i-norms. In addition, it establishes 
the the latent group Lasso is the tightest relaxation of block-coding penalties. 

There are several directions for future research. First, it would be of interest to determine for which 
combinatorial functions beyond submodular ones, efficient algorithms and consistency results can 
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Figure 6: Best Hamming distance (left column) and best least square error (right column) to the true 
parameter vector w* , among all vectors along the regularization path of a least square regression 
regularized with a given norm, for different patterns of values of w* . The different regularizers 
compared include the Lasso (LI), Ridge (L2), the elastic net (EN), the unweighted (GL) and weighted 
(GL+w) £1/^2 regularizations proposed by Jenatton et al. (2011a), the norms fif^ (Sub p = 2) and 
Sl^ (Sub p = 00) for a specified function F. (first row) Constant signal supported on an interval, 
with an a priori encoded by the combinatorial function F : A t— > d — 1 + range (A), (second row) 
Same setting with a signal w* supported by an interval consisting of coefficients w* drawn from a 
standard Gaussian distribution. In each case, the dimension is d = 256, the size of the true support 
is k = 160 , the noise level is a = 0.5 and signal amplitude ||iy||oo = 1- 
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Figure 7: Best Hamming distance (left column) and best least square error (right column) to the true 
parameter vector w* , among all vectors along the regularization path of a least square regression regularized 
with a given norm, for different patterns of values of w* . The regularizations compared include the Lasso 
(LI), Ridge (L2), the elastic net (EN), the unweighted (GL) and weighted (GL+w) £i/£2 regularizations 
proposed by Jenatton et al. (2011a), the norms Q% (Sub p — 2) and Q,^ (Sub p = oo) for a specified function 
F. Parameter vectors w* considered here have coefficients that are supported by a rectangle on a grid with 
size di x di with d = d\d2- (first row) Constant signal supported on a rectangle with an a priori encoded 
by the combinatorial function F : A ^ di + d2 — 4 + range(ni(j4)) + range(n2(yl)). (second row) Same 
setting with coefficients of w on the support given as w* li2 = g{cii)g(ci2) for c a positive constant and 
g : x i— > | cos(x) cos(5a;)| . (third row) Same setting with coefficients w* li2 drawn from a standard Gaussian 
distribution. In each case, the dimension is d = 256, the size of the true support is k = 160 , the noise level 
is a — 1 and signal amplitude || oo = 1. 
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be established. Then a sharper analysis of the relative performance of the estimators using different 
levels of a priori would be needed to answer question such as: When is using a structured a priori 
likely to yield better estimators? When could it degrade the performance? What is the relation to 
the performance of an oracle given a specified structured a priori? 
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A Form of primal norm 

We provide here a proof of lemma 6 which we first recall: 
Lemma (6). Ct p and fi* are dual to each other. 

Proof. Let lj£ be the function 9 defined by Wp(w) = F{A) 1 / q \\ wa\\ p i-{ v \Su PP (v)cA} ( w ) with lb the 
indicator function taking the value on B and oo on B c . Let be the set = {s | \\saWq < 
F(A)}. By construction, is the support function of (see Rockafellar, 1970, sec. 13), i.e. 
ujp(w) = maxjgjfi w T s. By construction we have {s \ Q*(s) < 1} — C\A<zvKp- But this implies 
that t{ s |oj( s )<i} = J2acv l k£- Finally, by definition of Fenchel-Legendre duality, 



or in words f2 p is the Fenchel-Legendre dual to the sum of the indicator functions l k a. But since 
the Fenchel-Legendre dual of a sum of functions is the infimal convolution of the duals of these 

9 Or gauge function to be more precise. 
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functions (see Rockafellar, 1970, Thm. 16.4 and Corr. 16.4.1, pp. 145-146), and since by definition 
of a support function (ik*) = ^pi then fl p is the infimal convolution of the functions w pl i.e. 

(IJw) = inf u> A (v A ) s.t. w = v A , 

which is equivalent to formulation (3). See Obozinski et al. (2011) for a more elementary proof of 
this result. □ 



B Example of the Exclusive Lasso 

We showed in Section 4.2 that the £ p exclusive Lasso norm, also called £ p /^i-norm, defined by the 

mapping w ^ I SgsS II^gIIi) ' ^ or some partition Q, is a norm fi^ providing the l v tightest 
convex p.h. relaxation in the sense defined in this paper of a certain combinatorial function F. A 
computation of the lower combinatorial envelope of that function F yields the function F_ : A i-> 
maxcee \A fl G\. 

This last function is also a natural combinatorial function to consider and by the properties of a 
LCE it has the same convex relaxation. It should be noted that it is however less obvious to show 
directly that £l p ~ is the l v jl\ norm... 

We thus show a direct proof of that result since it illustrates how the results on LCE and UCE can 
be used to analyze norms and derive such results. 

Lemma 9. Let Q = {Gi, . . . , Gk} be a partition of V. For F : A i-> max^gg \A n G\, we have 
= max Geg ||w G ||i- 

Proof. Consider the function / : w h-> max^gg ||^g||i and the set function F : A i-> /(1a)- We 
have F (A) = maxc^g IIIahgIIi = F{A)- But by Lemma 7, this implies that f(w) < fl^w) since 
/ = /(I ' I) is convex positively homogeneous and coordinatewise non-decreasing on R+. We could 
remark first that since F(A) = /(1a) < £1^,(1 a) < F(A), this shows that F = F_ is a lower 
combinatorial envelope. Now note that 

(fi£,)*(s) = max min r-^—^r, > max ||sa||i = max \sA — IIsgIIoo- 

v °°' w acv.a^0 Geg \ A n G\ Acv, |AnG|=i,Ges ^ iec 1 1 " 

This shows that (Q^,)*(s) > X^Gee II s g||co, which implies for dual norms that Q^w) < f(w). 
Finally, since we showed above the opposite inequality fi£, — f which shows the result. □ 



C Properties of the norm when F is submodular 

In this section, we first derive upper bounds and lower bounds for our norms, as well as a local 
formulation as a sum of 4>-norms on subsets of indices. 
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C.l Some important inequalities. 

We now derive inequalities which will be useful later in the theoretical analysis. By definition, the 
dual norm satisfies the following inequalities: 

<max l|SW H <Q*Js)= max JM^ < IM* r < ^i, (12) 

M5 k ^ v F({k})« P acv,a^0 F(A)« min AcV , A ± z F(A)« m« 

for m = mirikev F({k}) and M — maxfcgy F({k}). These inequalities imply immediately inequalities 
for fi p (and therefore for / since for n G R+, f(rj) — ^00(77)): 

m llq \\w\\ p s$ Q p (w) s$ M^IHIi. 
We also have fl p (w) ^ F(V) 1 / 9 ||w|| p , using the following lower bound for the dual norm: Q p (s) > 

\\s\\ P 

F(vy/i ■ 

Since by submodularity, we in fact have M = max^ F(A U {k}) — F(A), it makes sense to 
introduce m = mmA,k,F(Au{k})>F(A) F(A U {k}) — F(A) < m. Indeed, we consider in Section 6.5 
the norm fi Pi j (resp. Q p ) associated with restrictions of F to J (rcsp. contractions of F on J) and 
it follows from the previous inequalities that for all J C V, we have: 

m^lMlp < m 1/9 IHI P < Op,jM < and rh^Mp < Cl J p (w) M^w^. 



C.2 Some optimality conditions for 77. 

While exact necessary and sufficient conditions for n to be a solution of Eq. (6) would be tedious 
to formulate precisely, we provide three necessary and two sufficient conditions, which together 
characterize a non-trivial subset of the solutions, which will be useful in the subsequent analysis. 

Proposition 8 (Optimality conditions for 77). Let F be a non-increasing submodular function. Let 
p > 1 and w G R d , K = Supp(w) and J the smallest stable set containing K. Let H{w) the set of 
minimizers of Eq. (6). Then, 

(a) the set {vk, n G H(w)} is a singleton with strictly positive components, which we denote 
{i]k{w)}, i.e., Eq. (6) uniquely determines t\k- 

(b) For all n G H{w), then nja = 0. 

(c) If A\ U • • • U A m are the ordered level sets of tjk, i-e., rj is constant on each Aj and the values 
on Aj form a strictly decreasing sequence, then F(A\ U • • • U Aj) — F(A\ U • • • U Aj-i) > and the 

value on A, is equal to n A i (w) = [F(AlU ... uA . ) L"/ ( : 'i l 1 '' u ... uA ._ 1 ) ] i/ P ■ 

(d) If r\K is equal to t)k{w), m&x keJ \ K i]k ^ min^g^ Vk{w), and r/jc = 0, then n G H{w). 

(e) There exists n G H(w) such that """^v }, Wi ■ ^ mmj e jj]j < maxjgj^ ^ T^rfe- 

Proof, (a) Since / is non-decreasing with respect to each of its argument, for any 77 G H(w), we have 
n' G H{w) for 7/ defined through n' K = t\k and t\k c — 0. The set of values of r/K for 77 G H(w) is 
therefore the set of solutions problem (6) restricted to K. The latter problem has a unique solution 
as a consequence of the strict convexity on W + of r/j ^ ■ 

(b) If there is j G J c such that n G H{w) and r\j ^ 0, then (since Wj — 0) because / is non- 
decreasing with respect to each of its arguments, we may take r]j infinitesimally small and all other 
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rjk for k <E K c equal to zero, and we have f(rf) = fxiVKiw)) + rjj[F(K U {j}) — F(K)}. Since 
F(K U {j}) - F{K) > F(J U {j}) - F(J) > (because J is stable), we have /(r?) > f K (VK{w)), 
which is a contradiction. 

(c) Given the ordered level sets, we have f(r)) = J2"Li ^ i F ( Al U • • • U Aj) - F(A 1 U • • • U A,-i)], 
which leads to a closed-form expression ij Aj (w) — [ F ( AlU ... uA .)-f(Aiu—ua i)] 1 /^ " ^ ^(^i U ■ ■ ■ U 
Aj) — F(A\ U • • • U Aj_i) = 0, since \\w Aj || p > 0, we have r\ Ai as large as possible, i.e., it has to be 
equal to rj Aj - 1 , thus it is not a possible ordered partition. 

(d) With our particular choice for 77, we have J2 il£V + jjfiv) = &k{wk)- Since we always 
have fl(io) ^ ^(tDx), then 77 is optimal in Eq. (6). 

(e) We take the largest elements from (d) and bounds the components of rj K using (c). □ 

Note that from property (c), we can explicit the value of the norm as: 

fe 

n p (w) = U . . . U Aj) - F(A 1 U . . . U A,-_i))« IK^a^ Hp (13) 

i=i 

fe 

where Q A B is the norm associated with the contraction on A of F restricted to B. 



D Proof of Proposition 4 (Decomposability) 

Concretely, let c = ^ with M — max^gy F({fc}) and 

m = min F(AU{k})- F{A) s.t. FL4 U {fc}) > F(A) 

A.k 

Proposition (4. Weak and local Decomposability). (a) For any set J and any w e R d , we have 

Q(w) > Qj{wj) + n J (wja). 

(b) Assume that J is stable, and \\wjc\\ p < c 1 / p min ie j \wi\, then fl(w) = Qj(wj) + £l J (wjc). 

(c) Assume that K is non stable and J is the smallest stable set containing K, and that \\wjc\\ p < 
c 1/,p mm ie K \wi\, then Q(w) = £lj(wj) + £l J (wjc). 

Proof. We first prove the first statement (a): If \\s AnJ \\P < F(AnJ) and ||s An „Hl£ < F(AUJ)-F(J) 
then by submodularity we have \\s a \\p < F(A n J) + F(A U J) - F(J) < F(A). The submodular 
polyhedra associated with Fj and F J are respectively defined by 

P(Fj) = {s e R d , Supp(s) C J, s(A) < F(A), Ac J} and 
P(F J ) = {s e R d , Supp(s) c J c , s(A) < F(A U J) - F(J)} 

Denoting s op :— (s p , . . . , s^), we therefore have 

Q(w) = max s T w > max s T ?i; = Clj(wj) + tt J (wjc). 

{\s°P\eP(F)} {\ s °f\E P( Fj ), \ s °f c \e P(F-')} 
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In order to prove (b), we consider an optimal ijj for w,j and VI j and an optimal r\jc for fl J . Because 
of our inequalities, and because we have assume that J is stable (so that the value m for J is 
indeed lower bounded by m), we have 1 1 ^7,7= 1 1 00 < ^f/H" . Moreover, we have min^-g j rjj ^ """^l"' 1 
(inequality proved in the main paper). Thus when concatenating r\j and 77,7c we obtain an optimal r) 
for w (since then the Lovasz extension decomposes as a sum of two terms), hence the desired result. 

In order to prove (c), we simply notice that since F(J) = F(K), the value of r)j\ K is irrelevant (the 
variational formulation does not depend on it), and we may take it equal to the largest known possible 
value, i.e., one which is largest than m "^i/|"' i - , and the same reasoning than for (b) applies. □ 

Note that when p = 00, the condition in (b) becomes mhijgj |tOj| ^ maxjgjc \u>i\, and we recover 
exactly the corresponding result from Bach (2010). 



E Algorithmic results 
E.l Proof of Algorithm 1 

Algorithm 1 is a particular instance of the decomposition algorithm for the optimization of a convex 
function over the submodular polyhedron (see e.g. section 6.1 of Bach (2011)) Indeed denoting 
^ii^i) = min^.gR \(wi — z,i) 2 + A«;V 9 |i0i|, the computation of the proximal operator amounts to 
solving in k the problem 

max ^ ipi(Ki) 
KSRirrp f-r< 

Following the decomposition algorithm, one has to solve first 

max } ipi(Ki) s.t. } Ki = F(V) 

+ igv tev 

= min max }-\\w — z\\\ + Y^ K,y q \wi\ s.t. 'S^K i = F(V) 
= minhw-zWl + XFiVy^WwW,, 

w£R d 2 

where the last equation is obtained by solving the maximization problem in k, which has the unique 
solution Ki = F(V)j^w if w 7^ and the simplex of solutions {k e | k(V) = .F(V)} for w = 0. 

This is solved in closed form for p = 2 with w* = (\\z\\ 2 - Xy/F(V)) + j^ if z ^ and w* = else. 

In particular since w* cx z, then Ki = F(V)j^ is always a solution. Following the decomposition 
algorithm, one then has to find the minimizcr of the submodular function A i-> F(A) — k(A). Then 
one needs to solve 

min yjVi( K i) an d min YJ tpi(Ki). 

Using the expression of ipi and exchanging as above the minimization in w and the maximization in 
k, one obtains directly that these two problems correspond respectively to the computation of the 
proximal operators of CI Fa on za and of the proximal operator of f2 F on z V \a- 

The decomposition algorithm is proved to be correct in section 6.1 of Bach (2011) under the assump- 
tion that Ki 1 — y ip{Ki) is a strictly convex function. The functions we consider here are not strongly 
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convex, and in particular, as mentioned above the solution in k is not unique in case w* = 0. The 
proof of Bach (2011) however goes through using any solution of the maximization problem in k. 

E.2 Decomposition algorithm to compute the norm 

Applying Algorithm 1 in the special case where A = yields a decomposition algorithm to compute 
the norm itself (see Algorithm E.2). 



Algorithm 2 Computation of dp 7 (z) 

Require: z e R d . 

1: Let A = {j\ z 3 ^ 0}. 

2: if A ^ V then 
3: return ftp A (z A ) 

4: end if 

5: Let t e R d with U = ^F(V) 

6: Find A minimizing the submodular function 

F-t 

7: if A = V then 
8: return F{V) 1 / q \\x\\ p 

9: else 

10: return Slp A (z A ) + (za°) 

11: end if 



F Theoretical Results 

In this section, we prove the propositions on consistency, support recovery and the concentration 
result of Section 6.5. As there, we consider a fixed design matrix X e R nxp and y e ffi" a vector of 
random responses. Given A > 0, we define w as a minimizer of the regularized least-squares cost: 

min^jjd ±\\y-Xw\\% + \Q(w). (15) 
F.l Proof of Proposition 5 (Support recovery) 

Proof. We follow the proof of the case p = oo from Bach (2010). Let r = -^X T e <G R d , which is 
normal with mean zero and covariance matrix a 2 Q/n. We have for any w £ M. p , 

Q(w) ^ Slj(wj) + fl J (wjc) ^ Clj(wj) + pfljc(wjc) > pfl(w). 

This implies that Q*(r) ^ pmax{fi}(jv), (fl 7 )* (r 7 =)}. 
Moreover, rjc — QjcjQjjTj is normal with covariance matrix 

— {Qjcj- - Qj-jQjjQjJ") 4 o- 2 /nQjcjc 
This implies that with probability larger than 1 — 3P(0*(r) > A/wy/2), we have 
ft}(r 7 ) sC A/2 and (r! J )*(rj. - QjcjQj}rj) ^ Xq/2. 
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We denote by w the unique (because Qjj is invertible) minimum of ^\\y — + XSl(w), subject 

to wjc — 0. uij is defined through Qjj(wj — Wj*) — rj = — Xsj where sj € dflj(wj) (which implies 
that Q*j(sj) ^ 1) , i.e., wj — Wj — Qj}(rj — Xsj). We have: 



\wj-wy\loc < max\8jQjj(rj - Xsj)\ 
jeJ 

sC m&yiQ,j{Q~ 1 J 5 J )n*j(rj - Xsj) 



max||Q7^ j || p F(J) 1 - 1 /P[0*( rj ) + An}( Sj 



^ max^ 1 ! J| 1 / p F(J) 1 - 1 / p [fi}(r J ) + Afi}(*j)] |a| J| 1/p F( J) 1 " 1 ^" 1 . 
je./ " ' I 

Thus if 2\\J\ 1 /pF(J) 1 - 1 /pk,- 1 ^ v, then ||u> - w*\\ x s? ^, which implies Supp(w) D Supp(w*). 

In the neighborhood of w, we have an exact decomposition of the norm, hence, to show that w is the 
unique global minimum, we simply need to show that since we have (Q J )* (rjc — QjcjQjjrj) < Xrj/2, 
w is the unique minimizer of Eq. (10). For that it suffices to show that (Q J )*{Qjcj(wj — — rjc) < 
A. We have: 

(n J y(Qjcj(wj~w:j)-rjc) = (n J y(QjcjQj}(rj-\ Sj )-rj.) 

^ {n J )*{QjcjQ-j)rj - rjc) + \(n J y(QjcjQj jSj ) 
s$ (n J r(QjcjQjjrj - r, 7 c) + A(fi J )*[(a7(Q7)Qj,)) je ,/c] 
Xrj/2 + A(l — ry) < A, 

which leads to the desired result. □ 



F.2 Proof of proposition 6 (Consistency) 

Proof. Like for the proof of Proposition 5, we have 

Q(x) ^ Q,j(xj) + Q, j (xjc) ^ Qj(xj) + pQjc(xjc) ^ pil(x). 

Thus, if we assume fl*(q) ^ Xp/2, then £l*j(qj) ^ A/2 and {Cl J )*(qjc) ^ A/2. Let A = w - w* . 

We follow the proof from Bickel et al. (2009) by using the decomposition property of the norm 0. 
We have, by optimality of w: 

* A T QA + \rt(w* + A) + q T A ^ XSl(w* + A) + q T A < Xfl(w*) 

Using the decomposition property, 

XQj{(w* + A)j) + Xn J ((w* + A) j.) + qjAj + q] c Aj. ^ Attj( Wj ), 

xn J (Ajc) s; xnj( w *j) - xnj{ Wj + a.,) + o}( (?j )o j (a. / ) + (fi J )*(«je)fi J (Ajc), and 
(a - (n J y( qj c))n J (Ajc) < (a + n* ( 9j ))o j (a j ). 

Thus fi J (Ajc) ^ 3Slj(Aj), which implies A T QA > k||Aj||| (by our assumption which generalizes 
the usual ^-restricted eigenvalue condition). Moreover, we have: 

A T QA = A T (QA) s$ n(A)n*(QA) 

3A 

sC Q(A)(Q.*(q) + A) < yft(A) by optimality of to 
fi(A) < ^(A^ + p-^^Ajc) 

< fij(Aj)(3+-) < 4 0. 7 (A. 7 ). 
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This implies that nfij(Aj) 2 sC A T QA < ^Clj(Aj), and thus flj{Aj) < ^, which leads to the 
desired result, given the previous inequalities. 

□ 



F.3 Proof of proposition 7 

Proof. We have Ct*(z) = iaa^Aev F p^) 1 "/" • Thus, from the union bound, we get 

F(n*(z)>t) < ]T P(||z A ||*>i*F(A)). 

We can then derive concentration inequalities. We have E\\z A \\ q sC (E||z A ||«) 1 /« = (|A|E|e|«) 1 /9 
2\A\ 1 / q q 1 / 2 , where £ is a standard normal random variable. Moreover, \\zA\\ q ^ \\ z a\\2 for q ^ 2, 
and H^Hg ^ | j 4| 1/ ' 9_1 ^ 2 ||- 2: a||2 for (7^2. We can thus use the concentration of Lipschitz-continuous 
functions of Gaussian variables, to get for p ^ 2 and u ^ 0, 

P(||z A || 9 ^2|A| 1 /^ + u ) ^ e ^ 2 / 2 . 
For p < 2 (i.e., g > 2), we obtain 

P{\\z A \\ q > 2\A\^y/q + u)^e-^ A \ 1 - 2 ' q l\ 
We can also bound the expected norm E[fT(z)], as 



E[n*(z)} 4y/qlog(2\T> F \) max 



\A\ l ' q 



Aev F F(A) 1 /i' 



Together with Sl*(z) \\z\\ 2 m&XAev F F{A) i/ q — . w e get 



/ . 1^1(1/9-1/2)+ \ 

P fl*(z) ^ AJqlog(2\V F \) max ' ' , + u max 1 l —— -, < e~ u /2 . 



□ 
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